The Effects of Language Relatedness on Multilingual Information Retrieval: A Case Study With Indo-European and Semitic Languages
نویسندگان
چکیده
We explore the effects of language relatedness within a multilingual information retrieval (IR) framework which can be deployed to virtually any language, focusing specifically on Indo-European versus Semitic languages. The Semitic languages present unique challenges to IR for a number of reasons, so we set out to answer the question of whether cross-language IR for Semitic languages can be boosted by manipulation of the training data (which, in our framework, includes multilingual parallel text, some of which is morphologically analyzed). We attempted three measures to achieve this: first, the inclusion of genetically related (i.e., other Semitic) languages in the training data; second, the inclusion of non-related languages sharing the same script, and third, the inclusion of morphological analysis for Semitic languages. We find that language relatedness is a definite factor in boosting IR precision; script similarity can probably be ruled out as a factor; and morphological analysis can be helpful, but – perhaps paradoxically – not necessarily to the languages which are subjected to morphological analysis.
منابع مشابه
Early Phonological and Lexical Development of a Farsi Speaking Child: A Longitudinal Case Study
The present study aims at the description and analysis of the phonological and lexical development of a child who is acquiring Farsi as his first language. The child's language production at the holophrastic stage of language development, mainly single words, is observed and recorded longitudinally for nearly seven months since he was 16 months old until he turned 23 months. An attempt is mad...
متن کاملAre root letters compulsory for lexical access in Semitic languages? The case of masked form-priming in Arabic.
Do Semitic and Indo-European languages differ at a qualitative level? Recently, it has been claimed that lexical space in Semitic languages (e.g., Hebrew, Arabic) is mainly determined by morphological constraints, while lexical space in Indo-European languages is mainly determined by orthographic constraints (Frost, Kugler, Deutsch, & Forster, 2005). One of the key findings supporting the quali...
متن کاملModifying a Natural Language Processing System for European Languages to Treat Arabic in Information Processing and Information Retrieval Applications
The goal of many natural language processing platforms is to be able to someday correctly treat all languages. Each new language, especially one from a new language family, provokes some modification and design changes. Here we present the changes that we had to introduce into our platform designed for European languages in order to handle a Semitic language. Treatment of Arabic was successfull...
متن کاملAdapting a resource-light highly multilingual Named Entity Recognition system to Arabic
We present a working Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Recognition (NER) system was not developed for Arabic, but instead a highly multilingual, almost la...
متن کاملComparative Study of Degree of Bilingualism in Lexical Retrieval and Language Learning Strategies
This study compares lexical retrieval amongst monolinguals and intermediate bilinguals and advanced bilinguals. It also investigates the possible effects of their language learning strategies on their respective lexical retrieval advantage. The study used a mixed methods design and the groups consisted of 20 Persian near-monolinguals, 20 Persian-English intermediate level bilinguals, and 20 Per...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008